This is a countinuation to the preliminary EDA analysis on the data.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import plotly.express as px
from plotly.subplots import make_subplots
from statsmodels.graphics.tsaplots import month_plot
from statsmodels.api import tsa
import plotly.graph_objs as go
import calendar
# Note to self to trim the packages for this EDA, not all of these will be used at this time.
In [2]:
df = pd.read_csv('data/weatherstats_vancouver_hourly_clean.csv')
df.head()
Out[2]:
date_time_local pressure_station pressure_sea wind_dir wind_speed wind_gust relative_humidity dew_point temperature windchill humidex visibility health_index cloud_okta max_air_temp_pst1hr min_air_temp_pst1hr
0 2013-07-01 00:00:00 101.18 101.16 SSE 7 0.0 91 18.2 19.7 0.0 0.0 32200.0 2.9 5.0 19.4 18.5
1 2013-07-01 01:00:00 101.22 101.21 SE 6 0.0 89 17.8 19.6 0.0 0.0 32200.0 3.0 5.0 20.1 18.7
2 2013-07-01 02:00:00 101.26 101.24 E 11 0.0 88 16.7 18.7 0.0 0.0 32200.0 3.0 5.0 19.8 18.0
3 2013-07-01 03:00:00 101.26 101.25 E 4 0.0 84 16.5 19.2 0.0 0.0 32200.0 2.7 5.0 18.5 17.5
4 2013-07-01 04:00:00 101.30 101.28 NNW 5 0.0 87 15.7 17.9 0.0 0.0 32200.0 2.6 5.0 18.8 17.3
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87648 entries, 0 to 87647
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date_time_local      87648 non-null  object 
 1   pressure_station     87648 non-null  float64
 2   pressure_sea         87648 non-null  float64
 3   wind_dir             87648 non-null  object 
 4   wind_speed           87648 non-null  int64  
 5   wind_gust            87648 non-null  float64
 6   relative_humidity    87648 non-null  int64  
 7   dew_point            87648 non-null  float64
 8   temperature          87648 non-null  float64
 9   windchill            87648 non-null  float64
 10  humidex              87648 non-null  float64
 11  visibility           87648 non-null  float64
 12  health_index         87648 non-null  float64
 13  cloud_okta           87648 non-null  float64
 14  max_air_temp_pst1hr  87648 non-null  float64
 15  min_air_temp_pst1hr  87648 non-null  float64
dtypes: float64(12), int64(2), object(2)
memory usage: 10.7+ MB
In [4]:
df['date_time_local'] = pd.to_datetime(df['date_time_local'], utc=False)
df = df.set_index('date_time_local')
df.head()
Out[4]:
pressure_station pressure_sea wind_dir wind_speed wind_gust relative_humidity dew_point temperature windchill humidex visibility health_index cloud_okta max_air_temp_pst1hr min_air_temp_pst1hr
date_time_local
2013-07-01 00:00:00 101.18 101.16 SSE 7 0.0 91 18.2 19.7 0.0 0.0 32200.0 2.9 5.0 19.4 18.5
2013-07-01 01:00:00 101.22 101.21 SE 6 0.0 89 17.8 19.6 0.0 0.0 32200.0 3.0 5.0 20.1 18.7
2013-07-01 02:00:00 101.26 101.24 E 11 0.0 88 16.7 18.7 0.0 0.0 32200.0 3.0 5.0 19.8 18.0
2013-07-01 03:00:00 101.26 101.25 E 4 0.0 84 16.5 19.2 0.0 0.0 32200.0 2.7 5.0 18.5 17.5
2013-07-01 04:00:00 101.30 101.28 NNW 5 0.0 87 15.7 17.9 0.0 0.0 32200.0 2.6 5.0 18.8 17.3
In [5]:
df2 = df[['wind_speed', 'wind_gust', 'temperature', 'windchill', 'humidex', 'dew_point']]
In [6]:
# To further analyze the temperature I will resample the data to monthly averages.
df2_monthly = df2.resample("MS").mean()
# Using the folloting code to check that the averages are applied.
df2_monthly.head()
Out[6]:
wind_speed wind_gust temperature windchill humidex dew_point
date_time_local
2013-07-01 13.928763 0.000000 18.457527 0.000000 8.788978 13.483737
2013-08-01 12.366935 0.342742 18.443145 0.000000 8.071237 14.270833
2013-09-01 12.741667 2.688889 15.384444 0.000000 2.008333 13.043194
2013-10-01 9.686828 0.580645 9.302554 0.000000 0.029570 7.311559
2013-11-01 11.680999 3.586685 6.237032 -0.223301 0.000000 4.013454
In [7]:
# This graph will show us the monthly averages over the years.
fig = px.line(df2_monthly, x=df2_monthly.index, y='temperature',)
fig.update_layout(
    yaxis_title="Degrees", 
    xaxis_title="Year",
    legend_title="", 
    title="Monthly Temperature Average from July 2013 - June 2023"
)
fig.show()

Most July and August months have had the high temperatures, but August 2022 had the record highest temperature of all months by at least half a Celsius degree.

In [8]:
# Now we want to look at the Trend, Seasonal, and Residuals lines for further insights.
decomposition = sm.tsa.seasonal_decompose(df2_monthly["temperature"], model='additive')
In [9]:
# We need to create new columns for these values in our monthly average data set.
df2_monthly["Monthly Trend"] = decomposition.trend
df2_monthly["Monthly Seasonal"] = decomposition.seasonal
df2_monthly["Monthly Residual"] = decomposition.resid
In [10]:
# This graph will plot only these columns for the temperature.
cols = ["Monthly Trend", "Monthly Seasonal", "Monthly Residual"]

fig = make_subplots(rows=3, cols=1, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(
        go.Scatter(x=df2_monthly.index, y=df2_monthly[col]),
        row=i+1,
        col=1
    )

fig.update_layout(height=800, width=1200, showlegend=False)
fig.show()

The trend line in the monthly averages still shows the higher temperatures overall in August and July, as is to be expected on the hottest months. Interestingly, the trend does not show a sharp increase in the overall change in temperature in 2021. It shows that sharp increase in 2015 instead. Overall the trend shows that while the temperature overall has increased it has also "leveled out". The trend line does not consider the first six or the last six months. </br> The Seasonal line shows us what we already expected, high temperatures in the summer, lower temperarures in the winter. </br> The Residual line shows no patterns which we can use as confirmation that the data patterns and variability have been considered in the Trend and Seasonal Lines.

In [11]:
# To look further into the patterns and comparisons by month per each year, we will plot the Seasonal Difference.
df2_monthly["Monthly Seasonal_Difference"] = df2_monthly["temperature"].diff(12)
In [12]:
# The graph will not show the first twelve months as it calculates a rolling average.
fig = px.line(df2_monthly, x=df2_monthly.index, y="Monthly Seasonal_Difference")

fig.update_layout(
    yaxis_title="Difference (temperature)", 
    xaxis_title="Date",
    title="Change in Monthly Temperature Comparison"
)

fig.show()

Concentrating in the month of June accross the years, we can see that June 2021 had, on average a temperature 2.43 higher than it is expected on a typical June. The rest of the years June has not shown this drastic increase on average. The highest temperature average is typically in August any given year. That one record high temperature in 2021 appears to be an outlier, but it would be interesting to see what contributed to June 29, 2021 being particularly hot at 15:00. The trend line increase in 2015 could be explained by the difference in temperature accross the months in 2015 when compared to their corresponding typical months. For example, February 2015 was approximatley 5 degrees higher than a typical February.

In [13]:
# Looking at the data on a weekly average may reveal more insights.
df2_weekly = df2.resample("W").mean()
In [14]:
fig = px.line(df2_weekly, x=df2_weekly.index, y='temperature',)
fig.update_layout(
    yaxis_title="Degrees", 
    xaxis_title="Year",
    legend_title="", 
    title="Weekly Temperature Average from July 2013 - June 2023"
)
fig.show()

The weekly averages for temperature are already noticeably higher. That being said, the pattern still shows the summer and winter months as it should. However, the summers of 2021 and 2022 seems to have had higher weekly average temperatures by more than half a degree.

In [15]:
# As we did with the monthly averages, we will review the Trend, Seasonal, and Residual lines for weekly averages.
decomposition = sm.tsa.seasonal_decompose(df2_weekly["temperature"], model='additive')
In [16]:
df2_weekly["Weekly Trend"] = decomposition.trend
df2_weekly["Weekly Seasonal"] = decomposition.seasonal
df2_weekly["Weekly Residual"] = decomposition.resid
In [17]:
cols = ["Weekly Trend", "Weekly Seasonal", "Weekly Residual"]

fig = make_subplots(rows=3, cols=1, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(
        go.Scatter(x=df2_weekly.index, y=df2_weekly[col]),
        row=i+1,
        col=1
    )

fig.update_layout(height=800, width=1200, showlegend=False)
fig.show()

The trend line here shows a similar pattern as in the monthly average where there was higher temperature overall in 2015 and that temperature leveled off in later years. The temperatures still increased, but not as drastically as 2015. </br> The seasonal as residual lines in this case do not give us any additional insights.

In [18]:
# Now we move on to see the Seasonal Difference on the weekly average.
df2_weekly["Weekly Seasonal_Difference"] = df2_weekly["temperature"].diff(52)
In [19]:
fig = px.line(df2_weekly, x=df2_weekly.index, y="Weekly Seasonal_Difference")

fig.update_layout(
    yaxis_title="Difference (temperature)", 
    xaxis_title="Date",
    title="Change in Weekly Temperature Comparison"
)

fig.show()

To further support the idea of 2015 having higher temperatures outside of summer, the week of February 8th has an 11.49 degree difference from what that week's typical temperature is. The differences in weeks in July and August for 2021 and 2022 vary from almost no difference when compared to that typical week to approximately 6.5 degrees. </br></br> When considering the weekly averages the temperatures do not seem to be as different than their corresponding typical weeks' temperatures in the summer. For example, the week that includes the June 29th is only 4 degrees higher when compared to that typical week's temperature. The highest differences seem to be overall in the weeks of the winter months. This could indicate that overall the winters have been getting less cold on average, which is interesting to note as the trend we have heard about in the news is that winters have been getting colder.

In [20]:
# As a last step of this part of the analysis, we will now look at daily averages.
df2_daily = df2.resample("D").mean()
In [21]:
fig = px.line(df2_daily, x=df2_daily.index, y='temperature',)
fig.update_layout(
    yaxis_title="Degrees", 
    xaxis_title="Year",
    legend_title="", 
    title="Daily Temperature Average from July 2013 - June 2023"
)
fig.show()

Despite the record high hourly temperature on June 29, 2021, the highest average daily temperature in the past ten years, 26.42, was actually on June 28, 2021. The lowest average daily temperature, -10.85, does line up with that hourly low we mentioned previously on December 27, 2021.

In [22]:
decomposition = sm.tsa.seasonal_decompose(df2_daily["temperature"], model='additive')
In [23]:
df2_daily["Daily Trend"] = decomposition.trend
df2_daily["Daily Seasonal"] = decomposition.seasonal
df2_daily["Daily Residual"] = decomposition.resid
In [24]:
cols = ["Daily Trend", "Daily Seasonal", "Daily Residual"]

fig = make_subplots(rows=3, cols=1, subplot_titles=cols)

for i, col in enumerate(cols):
    fig.add_trace(
        go.Scatter(x=df2_daily.index, y=df2_daily[col]),
        row=i+1,
        col=1
    )

fig.update_layout(height=800, width=1200, showlegend=False)
fig.show()

The trend line for the daily average is significantly different than that of the weekly and monthly averages. It shows a slightly higher temperature trend in the summer days of 2021 and 2022 as well as a lower temperature trend on the winter days. This could indicate that when the data is grouped in weeks or months, the fluctuations in temperature in the short term are flattened out. </br> Its hard to interpret the seasonal line in this graph, but it seems to indicate the rise in temperature in the day and drop in temperature in the evening. </br> The Residual does not show a particular pattern here either, which lines up with the previous observations on the monthly average analysis.

In [25]:
df2_daily["Daily Seasonal_Difference"] = df2_daily["temperature"].diff(12)
In [26]:
fig = px.line(df2_daily, x=df2_daily.index, y="Daily Seasonal_Difference")

fig.update_layout(
    yaxis_title="Difference (temperature)", 
    xaxis_title="Date",
    title="Change in Daily Temperature Comparison"
)

fig.show()

The daily differences here indicate once more that the summer months, in this case as seen through the daily averages, have similar differences accross the years when compared to their respective typical days. </br></br> Considering the observations of this analysis, it would be interesting to ascertain how the different variables affect temperature and what could be indicators for a particularly high temperature day.